36 research outputs found
On The Robustness of a Neural Network
With the development of neural networks based machine learning and their
usage in mission critical applications, voices are rising against the
\textit{black box} aspect of neural networks as it becomes crucial to
understand their limits and capabilities. With the rise of neuromorphic
hardware, it is even more critical to understand how a neural network, as a
distributed system, tolerates the failures of its computing nodes, neurons, and
its communication channels, synapses. Experimentally assessing the robustness
of neural networks involves the quixotic venture of testing all the possible
failures, on all the possible inputs, which ultimately hits a combinatorial
explosion for the first, and the impossibility to gather all the possible
inputs for the second.
In this paper, we prove an upper bound on the expected error of the output
when a subset of neurons crashes. This bound involves dependencies on the
network parameters that can be seen as being too pessimistic in the average
case. It involves a polynomial dependency on the Lipschitz coefficient of the
neurons activation function, and an exponential dependency on the depth of the
layer where a failure occurs. We back up our theoretical results with
experiments illustrating the extent to which our prediction matches the
dependencies between the network parameters and robustness. Our results show
that the robustness of neural networks to the average crash can be estimated
without the need to neither test the network on all failure configurations, nor
access the training set used to train the network, both of which are
practically impossible requirements.Comment: 36th IEEE International Symposium on Reliable Distributed Systems 26
- 29 September 2017. Hong Kong, Chin
The Hidden Vulnerability of Distributed Learning in Byzantium
While machine learning is going through an era of celebrated success,
concerns have been raised about the vulnerability of its backbone: stochastic
gradient descent (SGD). Recent approaches have been proposed to ensure the
robustness of distributed SGD against adversarial (Byzantine) workers sending
poisoned gradients during the training phase. Some of these approaches have
been proven Byzantine-resilient: they ensure the convergence of SGD despite the
presence of a minority of adversarial workers.
We show in this paper that convergence is not enough. In high dimension , an adver\-sary can build on the loss function's non-convexity to make
SGD converge to ineffective models. More precisely, we bring to light that
existing Byzantine-resilient schemes leave a margin of poisoning of
, where increases at least like .
Based on this leeway, we build a simple attack, and experimentally show its
strong to utmost effectivity on CIFAR-10 and MNIST.
We introduce Bulyan, and prove it significantly reduces the attackers leeway
to a narrow bound. We empirically show that Bulyan
does not suffer the fragility of existing aggregation rules and, at a
reasonable cost in terms of required batch size, achieves convergence as if
only non-Byzantine gradients had been used to update the model.Comment: Accepted to ICML 2018 as a long tal
AKSEL: Fast Byzantine SGD
Modern machine learning architectures distinguish servers and workers. Typically, a d-dimensional model is hosted by a server and trained by n workers, using a distributed stochastic gradient descent (SGD) optimization scheme. At each SGD step, the goal is to estimate the gradient of a cost function. The simplest way to do this is to average the gradients estimated by the workers. However, averaging is not resilient to even one single Byzantine failure of a worker. Many alternative gradient aggregation rules (GARs) have recently been proposed to tolerate a maximum number f of Byzantine workers. These GARs differ according to (1) the complexity of their computation time, (2) the maximal number of Byzantine workers despite which convergence can still be ensured (breakdown point), and (3) their accuracy, which can be captured by (3.1) their angular error, namely the angle with the true gradient, as well as (3.2) their ability to aggregate full gradients. In particular, many are not full gradients for they operate on each dimension separately, which results in a coordinate-wise blended gradient, leading to low accuracy in practical situations where the number (s) of workers that are actually Byzantine in an execution is small (s < < f).
We propose Aksel, a new scalable median-based GAR with optimal time complexity (?(nd)), optimal breakdown point (n > 2f) and the lowest upper bound on the expected angular error (?(?d)) among full gradient approaches. We also study the actual angular error of Aksel when the gradient distribution is normal and show that it only grows in ?(?dlog{n}), which is the first logarithmic upper bound ever proven on the number of workers n assuming an optimal breakdown point. We also report on an empirical evaluation of Aksel on various classification tasks, which we compare to alternative GARs against state-of-the-art attacks. Aksel is the only GAR reaching top accuracy when there is actually none or few Byzantine workers while maintaining a good defense even under the extreme case (s = f). For simplicity of presentation, we consider a scheme with a single server. However, as we explain in the paper, Aksel can also easily be adapted to multi-server architectures that tolerate the Byzantine behavior of a fraction of the servers